Zurich by the Numbers - Predictive Insights into Tourism Dynamic

Authors
Affiliation

Name I, First Name I

University of Lausanne

Name II, First Name II

Published

May 15, 2024

Abstract

The following Forecasting project focuses on applying forecasting techniques to predict tourism trends in Zurich. This analysis aims to harness the power of historical data combined with forecasting algorithms to provide actionable insights into future tourism patterns. We engage in comprehensive data preparation, explore various predictive models, and conduct a detailed evaluation of their forecasting accuracy. The project encapsulates the challenge of turning complex data into understandable and strategic information, crucial for effective decision-making in Zurich’s tourism sector.

1 Exploration & Visualization

1.1 Objectives

The main objectives of this project is to predict :

  • The overnight stays of the visitors in Vaud, from October 2023 until December 2024.

  • The overnight satys of visitors from Philippines to Zürich, from October 2023 until December 2024.

2 DATA

2.1 Cleaning & Wrangling

2.1.1 Tourism Data - General Overview

The dataset contains information about the overnights stays by tourists in the various Swiss cantons. It indicates the tourist’s country of origin, the canton of stay, the month, the year and the total number of overnights stays.

As the dataset provided is in German, we have translated the data in English to make it more intuitive and understandable for everyone. Then, we created a new ‘Date’ column, year-month-day, which corresponds to the correct format to be able to make predictions.

Click to show code
# Load the data in folder data named Dataset_tourism.xlsx)
tourism_data <- readxl::read_xlsx(here("data/Dataset_tourism.xlsx"))

#removing value 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total
#tourism_data <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")
#print unique values in month column
unique(tourism_data$Monat)
#>  [1] "Januar"    "Februar"   "März"      "April"     "Mai"      
#>  [6] "Juni"      "Juli"      "August"    "September" "Oktober"  
#> [11] "November"  "Dezember"
# change ' [1] "Januar"    "Februar"   "März"      "April"     "Mai"       "Juni"      "Juli"      "August" "September" "Oktober"   "November"  "Dezember" into english month'
tourism_data$Monat <- tourism_data$Monat %>% recode_factor(
  "Januar" = "January",
  "Februar" = "February",
  "März" = "March",
  "April" = "April",
  "Mai" = "May",
  "Juni" = "June",
  "Juli" = "July",
  "August" = "August",
  "September" = "September",
  "Oktober" = "October",
  "November" = "November",
  "Dezember" = "December"
)
#add date type column for plotting purposes
tourism_data <- tourism_data %>% mutate(Date = dmy(paste("01", Monat, Jahr)))
# filtering out 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total
tourism_data_no_total <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")
#check for NAN
sum(is.na(tourism_data_no_total))
#> [1] 51395
#analyse the NAN values, where are they
(tourism_data_no_total %>% filter(is.na(value)))
#> # A tibble: 51,395 x 6
#>    Herkunftsland                  Kanton Monat Jahr  value Date      
#>    <chr>                          <chr>  <fct> <chr> <dbl> <date>    
#>  1 Malta                          Schwe~ Janu~ 2005     NA 2005-01-01
#>  2 Zypern                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  3 Mexiko                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  4 Übriges Zentralamerika, Karib~ Schwe~ Janu~ 2005     NA 2005-01-01
#>  5 Bahrain                        Schwe~ Janu~ 2005     NA 2005-01-01
#>  6 Katar                          Schwe~ Janu~ 2005     NA 2005-01-01
#>  7 Kuwait                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  8 Australien                     Schwe~ Janu~ 2005     NA 2005-01-01
#>  9 Neuseeland, Ozeanien           Schwe~ Janu~ 2005     NA 2005-01-01
#> 10 Oman                           Schwe~ Janu~ 2005     NA 2005-01-01
#> # i 51,385 more rows
#show data using reactable only showing the first 100 rows
reactable::reactable(head(tourism_data_no_total, 1000), searchable = TRUE)

2.1.2 Tourism Data - Vaud

Given the two objectives of the project, we are going to filter the initial dataset in order to keep and analyse only the cantons of interest. We start by filtering the “Kanton” column to keep only the canton of Vaud.

Click to show code
# Filter by canton Vaud 
tourism_vaud <- tourism_data %>% filter(Kanton == "Vaud")
#check for NAN
sum(is.na(tourism_vaud))
#> [1] 1869
#show the data in a table using reactable
reactable::reactable(head(tourism_vaud, 20))

2.1.3 Tourism Data - Zurich

We filtered the “Kanton” column to keep only the canton of Zurich.

Click to show code
#filter column 'Kanton' for Zurich
tourism_data_zurich <- tourism_data_no_total %>% filter(Kanton == "Zürich")
#check for NAN
sum(is.na(tourism_data_zurich))
#> [1] 1869
#analyse the NAN values, where are they
tourism_data_zurich %>% filter(is.na(value))
#> # A tibble: 1,869 x 6
#>    Herkunftsland                  Kanton Monat Jahr  value Date      
#>    <chr>                          <chr>  <fct> <chr> <dbl> <date>    
#>  1 Malta                          Zürich Janu~ 2005     NA 2005-01-01
#>  2 Zypern                         Zürich Janu~ 2005     NA 2005-01-01
#>  3 Mexiko                         Zürich Janu~ 2005     NA 2005-01-01
#>  4 Übriges Zentralamerika, Karib~ Zürich Janu~ 2005     NA 2005-01-01
#>  5 Bahrain                        Zürich Janu~ 2005     NA 2005-01-01
#>  6 Katar                          Zürich Janu~ 2005     NA 2005-01-01
#>  7 Kuwait                         Zürich Janu~ 2005     NA 2005-01-01
#>  8 Australien                     Zürich Janu~ 2005     NA 2005-01-01
#>  9 Neuseeland, Ozeanien           Zürich Janu~ 2005     NA 2005-01-01
#> 10 Oman                           Zürich Janu~ 2005     NA 2005-01-01
#> # i 1,859 more rows

#show the data in a table using reactable
reactable(head(tourism_data_zurich, 1000))

There are 1869 missing values for the two sub-datasets. These missing values come from the ‘value’ column, creating gaps in the time series. We’ll see later how we’re going to process them to do modelling.

2.1.4 Tourism Data - Zurich and Philipines

We are filtering the “Kanton” column and the ‘Herkunftsland’ column, keeping Zurich and Philippinen for the country of origin.

Click to show code
tourism_data_zurich_philippines <- tourism_data_zurich %>% filter(Herkunftsland == "Philippinen")
#show table using reactable
reactable::reactable(tourism_data_zurich_philippines)

Filtering for ‘Philippinen’ solved the problem of missing data we had with all countries of origin. The overnight stays are all included throughout the period.

However there had been missing values, we would have used one of the two ways of dealing with the problem. Firstly, we can simply take the section of data after the last missing value, assuming that there is a long enough series of observations to produce meaningful predictions. Secondly, we can replace the missing values with estimates. To do this, we first fit an ARIMA model to the data containing missing values, and then use the model to interpolate the missing observations.

Click to show code
# #Creating a tsibble with missing values
# data <- tourism_data_zurich_philippines %>%
#   as_tsibble(key = c(Kanton, Herkunftsland, Monat, Jahr)) %>%
#   select(Date, value) %>%
#   fill_gaps()
# 
# # Fit an ARIMA model to data with missing values
# model_fit <- data %>%
#   model(ARIMA(value))
# 
# # Interpolate missing values using the fitted ARIMA model
# filled_data <- model_fit %>%
#   interpolate(data)
# 
# # Print the data with filled in missing values
# print(filled_data)

3 EDA - Vaud

3.1 Visitors from different countries in Vaud

The graph shows the monthly number of overnight stays in Vaud from tourists of different countries. The period runs from January 2005 to September 2023.

Click to show code
# Create the ggplot object
plot_vaud <- tourism_vaud %>% 
  filter(Herkunftsland != 'Herkunftsland - Total') %>%
  ggplot(aes(x = Date, y = value, group = Herkunftsland, color = Herkunftsland,
             text = paste("Country:", Herkunftsland, "Trips:", value))) +  # Added text for tooltip
  geom_line(show.legend = FALSE) + 
  scale_color_viridis_d() +  # Use viridis color palette
  labs(title = "Number of visitors from Each Country to Vaud",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object with specified width and height
interactive_plot <- ggplotly(plot_vaud, tooltip = "text", width = 600, height = 400)


# Adjust plotly settings 
interactive_plot <- interactive_plot %>%
  layout(
    showlegend = FALSE # Remove legend
  )

# Display the interactive plot
interactive_plot

The time plot reveals some interesting features. - According to the graph, tourists come mainly from Switzerland. The second visitor country is France. - There are large dips in the number of overnight stays at the beginning of each year – these are due to holiday effects. - There was a important drop during the period in 2020 – this was due to the COVID pandemic. - For swiss tourists, there is visible increasing trend before and after the pandemic.

This time plot takes the total number of tourists in the canton of Vaud, combining all countries of origin. Here, we can better observe the seasonal pattern in the data. The number of tourists decreases at the end and beginning of each year and increases in the middle of the year during the summer holidays. There is also an increasing trend pattern if we do not take into account the period of the pandemic in 2020 which caused an important drop in travel and therefore tourism in Vaud. We’ll come back to this outlier later. Any forecasts of this serie would need to capture the seasonal pattern, and the fact that the trend is changing over the period.

Graphical view of total number of tourists in canton Vaud :

Click to show code
tourism_vaud_total <- tourism_vaud %>%
  filter(Herkunftsland == 'Herkunftsland - Total') %>%
  select(-c(Herkunftsland, Kanton, Monat, Jahr))

# Create the ggplot object with viridis color palette
plot_vaud_total <- tourism_vaud_total %>%
  ggplot(aes(x = Date, y = value)) +
  geom_line(color = viridis(1)) +  # Use viridis color palette for a single line
  labs(x = "Date", y = "Number of tourists", title = "Total number of tourists in canton Vaud") + 
  theme_minimal()

# Convert to an interactive plotly object with specified width and height
interactive_plot_total <- ggplotly(plot_vaud_total, width = 600, height = 400)

# Adjust plotly settings
interactive_plot_total <- interactive_plot_total %>%
  layout(
    showlegend = FALSE  # Remove legend if any
  )

# Display the interactive plot
interactive_plot_total

3.1.1 Decomposition

We have process an additive decomposition of the time series into three components: trend, seasonality and residual. These components will allow us to understand how they contribute to the variations observed in Swiss tourism data.

Click to show code
# Convert data to a time series object
vaud_ts <- tourism_vaud_total %>%
  arrange(Date) %>%
   # Filtre pour enlever les valeurs NA dans 'Date'
  filter(!is.na(Date)) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date, na.rm = TRUE), max(Date, na.rm = TRUE), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(value, frequency = 12, start = decimal_date(min(Date, na.rm = TRUE))))

# Decompose the time series
vaud_ts %>% decompose() %>% plot()

The main insights from this decomposition reflect what we have already observed.

- A clear upward trend until around 2020, when it peaks before falling sharply as a result of the pandemic and travel restrictions.

- Monthly seasonality, with clear and regular fluctuations due to seasonal factors.

- A stable residual component until 2020. After this period, there is a slight increase in volatility which may indicate that other events are having an impact on this time series which are not captured by the first two components.

4 EDA - Zurich

4.1 Zurich and All visiting countries

The graph shows the monthly number of overnight stays in Zurich from tourists of different countries.

Click to show code
# Preparing the data
#removing value in column 'Herkunftsland' as it is just the whole of Switzerland
data <- tourism_data_zurich %>%
  filter(!is.na(value)) %>%  # Removing rows with NA values in the 'value' column
  mutate(Monat = month(Date, label = TRUE, abbr = TRUE),  # Extract month 
         Jahr = year(Date)) %>%  # Extract year from Date
  group_by(Herkunftsland, Date) %>%  # Group by country and date
  summarise(Trips = sum(value), .groups = 'drop')  # Summing up trips for each country per date

p <- ggplot(data, aes(x = Date, y = Trips, group = Herkunftsland,
                      color = Herkunftsland == "Philippinen",
                      text = paste("Country:", Herkunftsland, "<br>Trips:", Trips))) +  # Added text for tooltip
  geom_line(show.legend = FALSE) +
    scale_color_viridis_d() +  # Use viridis color palette
  labs(title = "Number of Trips from Each Country to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object
interactive_plot <- ggplotly(p, tooltip = "text", width = 600, height = 600)

# Adjust plotly settings 
interactive_plot <- interactive_plot %>%
  layout(
    margin = list(l = 60, r = 60, b = 60, t = 80),  # Adjust margins
    showlegend = FALSE  # Show legend
  )

# Display the interactive plot
interactive_plot

As for Vaud, the most frequent visitors to Zurich are Swiss. Germany and United States are the two main foreign countries to visit Zurich. This can be explained by the fact that the canton of Zurich is closer to Germany and therefore easier to reach. The same applies to France with the canton of Vaud. The yellow curve represents the Philippines. The curve is flat and shows a considerably small number of trips from this country over the period. There is a drastic fall in 2020 caused by COVID-19. The pandemic has had a significant impact on the tourism industry worldwide. At first glance, there are regular seasonal peaks for most countries which suggests also the presence of seasonality in tourism in the canton of Zurich.

4.2 Zurich and Philippinens Visitors

This graph shows only visitors from the Philippines, as this is the country of interest in our analysis.

Click to show code
# Use tourism_data_zurich_philippines data to plot the values in y axis and Date in x axis
p <- ggplot(tourism_data_zurich_philippines, aes(x = Date, y = value)) +
  geom_line(color = viridis(1)) +  # Use viridis color palette for a single line
  labs(title = "Number of Trips from Philippines to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object with specified width and height
interactive_plot <- ggplotly(p, width = 600, height = 400)

# Adjust plotly settings
interactive_plot <- interactive_plot %>%
  layout(
    showlegend = FALSE  # Remove legend if any
  )

# Display the interactive plot
interactive_plot

4.2.1 Pattern

4.2.1.1 Decomposition

The additive time series decomposition of the monthly overnight stays for tourists coming from the Philippines to the canton of Zurich shows:

Click to show code
# Convert data to a time series object
tourism_ts <- tourism_data_zurich_philippines %>%
  arrange(Date) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date), max(Date), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(value, frequency = 12, start = decimal_date(min(Date))))

# Decompose the time series
decomposed <- decompose(tourism_ts)

# Plot the decomposed components
plot(decomposed)
tourism_ts
#>       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
#> 2005   57   30   46   73   74   73   57   69   70  103   46   47
#> 2006   49   29   55   81   82  113  115   87  100  134   90   52
#> 2007   77   35   50   83  109   94   42   52  112   66   45   48
#> 2008  122   55   58   82   76   58   46   53   58   70   35   45
#> 2009   76   35   95   88   89   90  100   96  111   81   75   46
#> 2010   52   66   83  127  164  125  106  105  139  219  110   70
#> 2011  103   92   86  177  161  189  126  175  169  201  195   72
#> 2012  120  131  138  208  286  191  110  123  227  247  145  110
#> 2013  181   66  159  233  239  182  136  128  156  205  174  104
#> 2014   96   89  138  287  316  223  225  194  326  340  158  230
#> 2015  109  139  239  290  478  295  326  273  320  388  221  269
#> 2016  201  202  297  556  442  517  343  278  387  495  268  383
#> 2017  202  344  282  810  746  437  560  377  459  519  280  513
#> 2018  178  358  368  563  657  536  461  376  446  498  370  518
#> 2019  201  161  209  706  661  731  576  401  442  562  334  749
#> 2020  202  133   76    3    3    9   35   24   18   14   13   22
#> 2021   14    9   10   13   18   18   49   56   78   68   74  132
#> 2022   78   51   85  142  282  516  473  452  578  850  655 1159
#> 2023  466  344  537 1096 1071 1120 1036  586  827

  • An upward trend until around 2020, when it falls sharply because of the pandemic and travel restrictions. The pandemic had a longer effect on the Philippine tourism, which stopped for a longer period (around 2 years or more).
  • Multiple peaks in the seasonal monthly component. These fluctuations are due to their calendar. Philippines start their summer holidays earlier than we do (31 of May - 29 of July) and have longer La Toussaint holidays (5 October - 18 October - 28 October).
  • A residual component with moderate variability which increases from 2020 onwards, indicating the influence of unforeseen or exceptional events (such as the pandemic) that have disrupted the usual models.

4.2.1.2 Seasonality

Seasonal sub-series plot permit to better visualize the monthly fluctuations of each year, from 2005 to 2023.

Click to show code
# Plot the seasonality in one chart
ggseasonplot(tourism_ts, year.labels = TRUE, year.labels.left = TRUE) +
  scale_color_viridis_d() +
  theme_minimal()

Click to show code
# several chart per month to see the seasonality
ggsubseriesplot(tourism_ts) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")

#debug
#better to use gg_subseries to see the seasonality
#tourism_ts %>% gg_subseries(value) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")

The months of May to July and October seem to have visitor peaks, which may indicate a high tourist season during this period. As we saw before, this is due to their calendar. The years 2022 and 2023 show a significant increase in visitor numbers compared with previous years. In particular, the months from May to October 2022 and 2023 show much higher values. This growth may be due to a number of factors, such as a post-pandemic recovery in travel or specific initiatives that have attracted more tourists.

4.2.1.3 Trend

There is an upward trend until around 2020, when it falls sharply because of the pandemic and travel restrictions. The pandemic had a longer effect on the Philippine tourism, which stopped for a longer period (around 2 years or more).

5 Modelling

This part is about building on your knowledge of time series techniques to model your data. You can investigate various models but you should justify in your report your choices regarding these. Pay attention to the conditions that are needed to apply a specific model. Treat also carefully seasonality, outliers, colinearity, covariates, special events, etc. Remember the following steps:

  1. Aggregation choice for hierarchical time series
  2. Model building
  3. Model selection

5.1 Total number of visitors in Vaud

5.1.1 Outliers, Correlation, Colinearity, Covariates, Special Events ?

Questions ?

5.1.2 ETS model

This ETS model generates forecasts for the next two years (h = 24) and takes into account the Trend, Errors and Seasonality present in the time series. These three components are additive, which is why it is an AAA model.

Click to show code
ets_vaud <- ets(vaud_ts, model = "AAA")
forecast_ets_vaud <- forecast(ets_vaud, h = 24) %>% plot(main = "Forecast of visitors in Vaud", xlab = "Date", ylab = "Number of visitors")

5.1.3 ARIMA

The ARIMA model has been adjusted to take account of the seasonality of the data, which is crucial for time series with pronounced seasonal variations, as is the case here.

Click to show code
arima_vaud <- auto.arima(vaud_ts, seasonal = TRUE)

# Generate forecasts for the next 2 years (24 months)
forecast_arima_vaud <- forecast(arima_vaud, h = 24)

# Plot the forecast
plot(forecast_arima_vaud, main = "ARIMA Forecast for Vaud Tourism", xlab = "Date", ylab = "Number of Tourists")

Click to show code
# Fit ARIMA model with specified parametes
arima_model <- arima(vaud_ts, order = c(5, 1, 0), seasonal = list(order = c(0, 1, 1), period = 12))

forecast_a_vaud <- arima_model %>%
  forecast(h = 24)

# # Plot the forecast
# forecast_a_vaud %>%
#   autoplot(data = vaud_tsibble, main = "ARIMA Forecast for Vaud Tourism", ylab = "Number of Tourists")
# 
# #Provide forecast in table
# as.data.frame(forecast_a_vaud) %>% kable(caption = "Forecast for Vaud Tourism") %>%
#   kable_styling(full_width = FALSE)

The graph shows that the seasonality observed in the historical data continues to manifest itself in the future forecasts, with recurring seasonal peaks and troughs. This demonstrates the robustness of the ARIMA model Forecasts for the next two years (24 months) are represented by the blue line. The grey bands around the blue line represent the forecast confidence interval, indicating the range of values within which future values are likely to lie with a certain probability. Concerning the trend, the model predicts that this upward trend will continue.

5.2 Zurich and Philipines

5.2.1 Forecast without dealing with Covid

5.2.1.1 Naive Forecast

The graph shows the historical trend in the number of tourists from the Philippines to Zurich and the forecasts for the next 24 months using the naive model. We took the graph representing the total number of tourists coming from the Philippines to Zurich.

Click to show code
#convert tourism_ts to tsibble
tourism_ts <- tourism_ts %>% as_tsibble()
# Fit a naive model
fit <- tourism_ts %>%
  model(NAIVE = NAIVE(value))
# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 15)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot <- forecast %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "Forecast of tourists from Philipines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))
plot

#get AIC metric
metrics_naive <- fit %>% accuracy()
# Display accuracy metrics in an HTML table
metrics_naive %>%
  kable("html", caption = "Model Accuracy Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Model Accuracy Metrics
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
NAIVE Training 3.44 135 79.2 -18.8 48.3 0.698 0.629 -0.3

This naive model predicts that future values will be equal to the last observed value in the time series. It does not take into account the past events like the pandemic and assumes here that the levels observed after this extreme fall will remain unchanged. The model does not take into account trends or seasonality neither, which are very present in our case. It’s a simplified approach.

The blue areas represent the 80% (darker) and 95% (lighter) confidence intervals of the forecasts. The wider the interval, the greater the uncertainty about the long-term forecasts which is the case here.

5.2.1.2 Exponential Smoothing

  • Why additive errors ? Because the variance of the errors is constant over time no ?
Click to show code
# Fit an ETS model
# Adjusting the model parameters according to the characteristics of the data
# Here "A" means additive error, "N" means no trend, and "N" means no seasonality
# change these if needed
fit <- tourism_ts %>%
  model(ETS_M_seaso = ETS(value ~ error("A") + trend("A") + season("A"))) #multiplicative seasonality)
# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 15)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot <- forecast %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "Forecast of tourists from Philipines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))
plot

# Calculate the accuracy of the training set
metrics_ets_AAA <- fit %>% accuracy()

# Display accuracy metrics in an HTML table
metrics_naive %>%
  kable("html", caption = "Model Accuracy Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Model Accuracy Metrics
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
NAIVE Training 3.44 135 79.2 -18.8 48.3 0.698 0.629 -0.3

Clearly see here that the confidence interval is too big, almost like a naive forecast

Why trend dp and seaso M ? - Trend is present so A - Seasonality is present and growing over time so Multiplicative was chosen

Click to show code
# comparing several model
fit <- tourism_ts %>%
  model(
        ETS_M_seaso = ETS(value ~ error("A") + trend("A") + season("M")), #multiplicative seasonality
        ETS_M_seaso_Ad = ETS(value ~ error("A") + trend("Ad") + season("M")), #dampted trend
  )

# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 24)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot <- forecast %>%
  autoplot(tourism_ts, level = 90, color = "blue", alpha = 0.5) +
  labs(title = "Forecast of tourists from Philipines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))
plot

# Calculate the accuracy of the training set
metrics_ets <- fit %>% accuracy()


# Display accuracy metrics in an HTML table
metrics_ets %>%
  kable("html", caption = "Model Accuracy Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Model Accuracy Metrics
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
ETS_M_seaso Training 2.93 85.7 55.3 -81.5 102 0.487 0.399 0.040
ETS_M_seaso_Ad Training 2.06 74.7 48.9 -78.7 105 0.431 0.348 0.181

5.2.1.2.1 Thus Chosen Model :
Click to show code
fit <- tourism_ts %>%
  model(ETS_M_seaso = ETS(value ~ error("A") + trend("Ad") + season("M"))) #multiplicative seasonality)
# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 24)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot <- forecast %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "Forecast of tourists from Philipines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))
plot

5.2.1.3 ARIMA

Question : Do we need to differentiate the data ?

Click to show code
# Fit an automatic ARIMA model
fit_arima <- tourism_ts %>%
  model(ARIMA_auto = ARIMA(value))

# Forecast the next 2 years (24 months)
forecast_arima <- fit_arima %>%
  forecast(h = 24)

# Plot the forecasts along with the historical data
plot_arima <- forecast_arima %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "ARIMA Forecast of Tourists from the Philippines to Zurich",
       x = "Date",
       y = "Number of Tourists") +
  guides(colour = guide_legend(title = "Forecast"))
plot_arima

# Calculate the accuracy of the training set
metrics_arima <- fit_arima %>% accuracy()


# Display accuracy metrics in an HTML table
metrics_arima %>%
  kable("html", caption = "Model Accuracy Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Model Accuracy Metrics
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
ARIMA_auto Training 6.74 113 67.7 -68.7 123 0.597 0.526 -0.024

Ugly forecast, confidence interval is too big

Click to show code
# using auto.arima with stepwise and approximation options turned off for a more thorough search
fit_updated <- auto.arima(tourism_ts, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)
summary(fit_updated)
#> Series: tourism_ts 
#> ARIMA(4,1,0)(1,0,0)[12] 
#> 
#> Coefficients:
#>          ar1     ar2     ar3     ar4   sar1
#>       -0.407  -0.161  -0.236  -0.270  0.462
#> s.e.   0.064   0.071   0.073   0.066  0.074
#> 
#> sigma^2 = 12436:  log likelihood = -1373
#> AIC=2758   AICc=2758   BIC=2778
#> 
#> Training set error measures:
#>                ME RMSE  MAE   MPE MAPE  MASE   ACF1
#> Training set 5.55  110 66.9 -77.7  147 0.589 -0.041

5.2.2 Forecast but dealing with Covid now

5.2.2.1 Special Event- Covid impact

Explanation

5.2.2.1.1 How philipines dealt with it

In 2021, the Philippines faced the challenges of the COVID-19 pandemic with a mix of resilience and adaptation.

CHANGE THIS TEXT ITS BULLSHIT ***** - Lockdowns and Waves: The country experienced two waves of COVID-19, leading to prolonged lockdowns throughout the year. These restrictions aimed to curb the spread of the virus. - Vaccination Campaign: Despite challenges, millions of Filipinos received COVID-19 vaccines. However, experts raised concerns about the campaign’s sluggish pace. - Senate Investigations: Lawmakers probed alleged anomalies in pandemic contracts, leading to marathon Senate hearings. - Easing Restrictions: Towards the end of the year, daily cases dropped, and mandatory face shield policies were lifted. This signaled progress in overcoming the crisis. - Risk-Based Decisions: While the holidays looked promising, the threat of new variants remained. Experts advised practicing “risk-based” decisions for activities despite low case numbers1. - Filipinos have become more mindful of hygiene practices, including social distancing, mask-wearing, and proper handwashing. The pandemic also prompted a shift in consumption patterns, with increased focus on essentials and at-home entertainment. However, air travel remains limited due to ongoing concerns.

As for travel, the Philippines continues to navigate the balance between safety and economic recovery. While some travel restrictions have eased, travelers must stay informed about evolving guidelines and exercise caution when planning trips. The threat of new variants underscores the need for vigilance and informed decision-making1 ******

Question : How to deal with this blackswan event ?

5.2.2.1.2 Covariates

TEXT A PAUFINER Covariate Adjustment Adjust your forecasts to account for the impact of COVID-19 by including covariates that capture the effects of the pandemic. These covariates can be used to adjust the forecasts based on the observed changes in the data due to COVID-19. For example, you can include a covariate that captures the effect of lockdowns or travel restrictions on tourism data. Scenario Analysis Conduct scenario analysis to explore the potential impact of different COVID-19 scenarios on your forecasts. By considering a range of possible outcomes, you can better prepare for the uncertainties associated with the pandemic. Sensitivity Analysis Evaluate the sensitivity of your forecasts to changes in key assumptions or parameters. By conducting sensitivity analysis, you can identify the factors that have the greatest impact on your forecasts and assess the robustness of your models.

5.2.2.1.2.1 Data Integration

The Oxford COVID-19 Government Response Tracker (OxCGRT) provides valuable data on government responses to the COVID-19 pandemic, including a Stringency Index that quantifies the severity of lockdown measures. Let me provide more details about this index:

Stringency Index: The Stringency Index is a composite measure that evaluates the strictness of government policies related to COVID-19. It calculates a score based on nine key response metrics: - School closures - Workplace closures - Cancellation of public events - Restrictions on public gatherings - Closures of public transport - Stay-at-home requirements - Public information campaigns - Restrictions on internal movements - International travel controls Each metric is assigned a value between 0 and 100, with a higher score indicating a stricter response. The overall Stringency Index is the mean score of these nine metrics. If policies vary at the subnational level, the index reflects the strictest sub-region’s response level. Importantly, the Stringency Index records the strictness of government policies but does not measure or imply the appropriateness or effectiveness of a country’s response. A higher score does not necessarily mean that a country’s response is better than others lower on the index.

source - Our World In Data

Click to show code
# Convert data to a time series object
tourism_ts <- tourism_data_zurich_philippines
#rename Date as date
names(tourism_ts)[names(tourism_ts) == "Date"] <- "date"
  
#read .csv with stringency index
stringency_df <- read.csv(here("data/stringency_index.csv"))

# Filter data by location
stringency_philippines <- filter(stringency_df, location == "Philippines")
stringency_switzerland <- filter(stringency_df, location == "Switzerland")

# Convert dates and set them to the first day of the month
stringency_philippines$date <- as.Date(format(dmy(stringency_philippines$date), "%Y-%m-01"))
stringency_switzerland$date <- as.Date(format(dmy(stringency_switzerland$date), "%Y-%m-01"))

# Aggregate to monthly average, ensuring date format is maintained
stringency_philippines <- stringency_philippines %>%
  group_by(date) %>%
  summarize(avg_stringency_index = mean(stringency_index, na.rm = TRUE))

stringency_switzerland <- stringency_switzerland %>%
  group_by(date) %>%
  summarize(avg_stringency_index = mean(stringency_index, na.rm = TRUE))

# Merge with Philippines data first
merged_data_philippines <- merge(tourism_ts, stringency_philippines, by = "date", all.x = TRUE)

# Then merge with Switzerland data
merged_data <- merge(merged_data_philippines, stringency_switzerland, by = "date", all.x = TRUE, suffixes = c("_PH", "_SW"))

# Replace NA values in avg_stringency_index with 0 if necessary
merged_data$avg_stringency_index_PH[is.na(merged_data$avg_stringency_index_PH)] <- 0
merged_data$avg_stringency_index_SW[is.na(merged_data$avg_stringency_index_SW)] <- 0


# Create a ggplot of the stringency index
ggplot(merged_data, aes(x = date, y = avg_stringency_index_PH, color = "Philippines")) +
  geom_line() +
  geom_line(aes(y = avg_stringency_index_SW, color = "Switzerland")) +
  labs(title = "Stringency Index in the Philippines and Switzerland",
       x = "Date",
       y = "Stringency Index") +
  scale_color_manual(values = c("#3C5B6F", "darkred"),
                     labels = c("Philippines", "Switzerland"))

#show merge data using reactable
reactable(merged_data)

5.2.2.1.2.2 Model Selection

Choose a forecasting model that can incorporate exogenous variables (covariates). Models such as:

  • Multiple Regression Analysis: Simple yet effective, if the relationships between the covariates and the dependent variable (tourist numbers) are linear.
Click to show code
# Fit a multiple regression model
model <- lm(value ~ avg_stringency_index_PH + avg_stringency_index_SW, data = merged_data)

# Summary of the model
summary(model)
#> 
#> Call:
#> lm(formula = value ~ avg_stringency_index_PH + avg_stringency_index_SW, 
#>     data = merged_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -318.9 -157.3  -69.3   93.7  873.7 
#> 
#> Coefficients:
#>                         Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               246.32      15.85   15.54   <2e-16 ***
#> avg_stringency_index_PH     5.87       2.55    2.30   0.0221 *  
#> avg_stringency_index_SW   -12.19       3.73   -3.26   0.0013 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 220 on 222 degrees of freedom
#> Multiple R-squared:  0.0931, Adjusted R-squared:  0.0849 
#> F-statistic: 11.4 on 2 and 222 DF,  p-value: 1.95e-05

# Forecast the next 24 months
forecast_values <- predict(model, newdata = merged_data)

#us gtsummary to show the summary of the model
model %>%
  gtsummary::tbl_regression()
Characteristic Beta 95% CI1 p-value
avg_stringency_index_PH 5.9 0.85, 11 0.022
avg_stringency_index_SW -12 -20, -4.8 0.001
1 CI = Confidence Interval
Click to show code

#create a table with the metrics of the model and show it as an html
model_metrics <- model %>%
  broom::glance()
model_metrics
#> # A tibble: 1 x 12
#>   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
#>       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
#> 1    0.0931        0.0849  220.      11.4  1.95e-5     2 -1531. 3070.
#> # i 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
#> #   nobs <int>

# show html table with metrics
model_metrics %>%
  kable("html", caption = "Model Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Model Metrics
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.093 0.085 220 11.4 0 2 -1531 3070 3084 10734309 222 225
  • ARIMAX (Autoregressive Integrated Moving Average with Exogenous Variables): Suitable for handling time series data with external influences.
Click to show code
# Transform to a time series object with frequency 12 (monthly data)
tourism_ts <- ts(merged_data$value, frequency = 12, start = c(2005, 1))  # Adjust the start time as per your data

# Ensure exogenous variables have the same length and frequency
exog_data <- cbind(merged_data$avg_stringency_index_PH, merged_data$avg_stringency_index_SW)

# Check if lengths of tourism_ts and exog_data match
if (length(tourism_ts) == nrow(exog_data)) {
  # Fit an ARIMAX model
  model <- auto.arima(tourism_ts, xreg = exog_data, seasonal = TRUE)
  
  # Summary of the model
  summary(model)
  
  # Forecast the next 24 months, setting future exogenous variables to 0
  future_exog <- matrix(0, nrow = 24, ncol = 2)
  
  forecast_values <- forecast(model, xreg = future_exog, h = 24)
  
  # Plot the forecast along with the actual data using autoplot from the forecast package
  plot_forecast <- autoplot(forecast_values, series = "Forecast") +
    autolayer(tourism_ts, series = "Actual Data") +
    labs(title = "ARIMAX Forecast of Tourists from the Philippines to Zurich",
         x = "Date",
         y = "Number of Tourists") +
    guides(colour = guide_legend(title = "Data Type"))
  
  print(plot_forecast)
  
  # Print forecasted values
  print(forecast_values)
  
  # Calculate evaluation metrics on the training data
  residuals <- residuals(model)
  mae <- mean(abs(residuals))
  mape <- mean(abs(residuals / tourism_ts)) * 100
  rmse <- sqrt(mean(residuals^2))
  aic <- AIC(model)
  bic <- BIC(model)
  
  # Print evaluation metrics
  cat("Evaluation Metrics:\n")
  cat("MAE:", mae, "\n")
  cat("MAPE:", mape, "\n")
  cat("RMSE:", rmse, "\n")
  cat("AIC:", aic, "\n")
  cat("BIC:", bic, "\n")
} else {
  stop("Length of tourism_ts and exog_data do not match. Please check the data.")
}
#>          Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#> Oct 2023            969   827  1111   752  1186
#> Nov 2023            895   745  1046   665  1126
#> Dec 2023           1109   956  1261   875  1342
#> Jan 2024            744   590   898   509   980
#> Feb 2024            705   551   860   469   942
#> Mar 2024            790   634   945   552  1028
#> Apr 2024           1092   936  1248   853  1331
#> May 2024           1042   885  1199   802  1282
#> Jun 2024           1024   866  1181   782  1265
#> Jul 2024            995   836  1153   753  1237
#> Aug 2024            789   629   948   545  1032
#> Sep 2024            883   723  1043   639  1128
#> Oct 2024            895   715  1076   619  1171
#> Nov 2024            901   716  1086   619  1184
#> Dec 2024            905   719  1092   620  1191
#> Jan 2025            909   721  1097   621  1197
#> Feb 2025            913   723  1103   623  1203
#> Mar 2025            917   726  1108   624  1209
#> Apr 2025            920   728  1113   626  1214
#> May 2025            924   730  1118   628  1220
#> Jun 2025            927   733  1122   629  1226
#> Jul 2025            931   735  1127   631  1231
#> Aug 2025            935   737  1132   633  1237
#> Sep 2025            938   740  1137   634  1243
#> Evaluation Metrics:
#> MAE: 68.2 
#> MAPE: 68.7 
#> RMSE: 109 
#> AIC: 2757 
#> BIC: 2781

#show metric in html
model_metrics <- data.frame(
  Model = "ARIMAX",
  MAE = mae,
  MAPE = mape,
  RMSE = rmse,
  AIC = aic,
  BIC = bic
)

#show html table with metrics
metrics_arimax <- model_metrics %>%
  kable("html", caption = "Model Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
metrics_arimax
Model Metrics
Model MAE MAPE RMSE AIC BIC
ARIMAX 68.2 68.7 109 2757 2781

  • State Space Models/Structural Time Series Model: These models are flexible in handling multiple covariates with varying impacts over time.
5.2.2.1.3 Incorporating Dummy Variables

Introduce dummy variables into your forecasting models to account for the impact of COVID-19. These variables can be set to 1 for the periods affected by the pandemic and 0 otherwise. This approach allows the model to differentiate the impact of COVID-19 from normal variations in the data.

5.2.2.2 Other ideas

  • Replacing covid values with the ARIMA, If the missing values are random or if excluding them would result in a loss of valuable information, we might consider imputing them. One common approach is to use statistical models like ARIMA to interpolate missing values based on the patterns observed in the available data.
  • Delete the timestamp of the covid period and use fill_gaps() to fill the missing values and then use the a model to predict the missing values.
  • If we use the ARIMA model, we can use covariates to account for the impact of COVID-19. By including a dummy variable that captures the effect of the pandemic, we can adjust the forecasts to reflect the changes in the data due to COVID-19.

6 Model Selection

Use Information Criteria for Model Comparison: Evaluate models based on criteria such as AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion), which help in selecting a model that balances goodness of fit with complexity.

Click to show code
metrics_naive %>%
  kable("html", caption = "Model Accuracy Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Model Accuracy Metrics
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
NAIVE Training 3.44 135 79.2 -18.8 48.3 0.698 0.629 -0.3
Click to show code
metrics_ets %>%
  kable("html", caption = "Model Accuracy Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Model Accuracy Metrics
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
ETS_M_seaso Training 2.93 85.7 55.3 -81.5 102 0.487 0.399 0.040
ETS_M_seaso_Ad Training 2.06 74.7 48.9 -78.7 105 0.431 0.348 0.181
Click to show code
metrics_arima %>%
  kable("html", caption = "Model Accuracy Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Model Accuracy Metrics
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
ARIMA_auto Training 6.74 113 67.7 -68.7 123 0.597 0.526 -0.024
Click to show code
metrics_arimax
Model Metrics
Model MAE MAPE RMSE AIC BIC
ARIMAX 68.2 68.7 109 2757 2781